Model specification is an element of statistics whose importance is often glossed over. In most cases, there are multiple appropriate models for a set of data. The end choice of model can hugely impact the results, and classical methods offer limited guidance on the best process for accounting for the uncertainty this creates. Unstructured searches and checks for the best model specification can lead to incorrect inferences, fragile reported findings, and publication bias [@Montgomery]. Frequentist approaches to model selection include but are not limited to general using the Akaike Information Criterion (AIC) and using the root mean squared error (RMSE) or R^2. We will thus elaborate a little on AIC and two other common information criterion, BIC and WAIC.
Information Criterion are commonly used to select the ‘best’ model from a set or to understand how much better one model is compared to another. They can be thought of as trying to estimate how good the model is at fitting out of sample data, when no such data is available. All of these criterion consist of a likelihood component, which estimates how good the model is at fitting within sample data, and a bias corrector, to account for how models are usually better at fitting within sample data than out of sample data. For all information criterion, a lower score is better. They can only be directly compared to themselves.
The Akaike Information Criterion (AIC) is the simplest information criterion, and the most commonly used:
\[AIC = -2 ln(L) + 2k\]
Where \(L\) is the model’s likehood, and \(k\) is the number of parameters in the model. Because more parameters allow a model to fit better to within sample data, it is used as the bias corrector.
Given multiple models \(i\), you can estimate the probability a model \(k\) minimizes information loss as such, where \(AIC_{min}\) is the lowest score in the set of models.
\[P_k = exp\left(\frac{AIC_{min} - AIC_k}{2}\right)\]
AIC fits well in the freqentist frame. The likelihood component is assumed to come from the maximum likelihood estimates of paramters, which Bayesian methods don’t select for, and point estimates for parameters are used, instead of using the full posterior distribution.
The misleadingly named Bayesian Information Criterion (BIC), is given by:
\[BIC = - 2ln(L) + k\;ln(n)\] Where \(k\) is the number of parameters in the model, \(n\) is the number of data points, and \(L\) is again the likelihood function.
BIC penalizes free parameters more than the AIC when data sets are large. While AIC tries to select the best model that describes the data presented, the BIC attempts to select the true model from among a model set. However, like AIC, BIC doesn’t naturally fit with the Bayesian framework as it is scored by the likelihood function.
One information criterion that fits naturally within the Bayesian framework is the Widely Applicable Information Criterion (WAIC):
\[WAIC = -2\sum_{i=1}^nlog\left(\frac{1}{S}\sum_{s=1}^S p(y_i\vert\theta^s) \right) +2\sum_{i=1}^nV_{s=1}^S(log(p(y_i\vert\theta^s)))\]
Where \(S\) is the number of posterior draws, \(\theta^s\) is the s-th posterior draw for parameters \(theta\), and \(V^S_{s=1}(a_s)\) is the sample variance of \(a_s\).
The first half of WAIC uses the posterior distribution of \(\theta\), as opposed to only likelihood estimates for theta as are used in AIC and BIC. This means that the WAIC is fully Bayesian. The second half is the bias corrector which can be thought of as the number of unconstrained parameters accounting for the complex ways Bayesian parameters interact.
###http://www.jmlr.org/papers/volume14/watanabe13a/watanabe13a.pdf not sure how to cite but let’s cite this info
Bayesian Model Averaging (BMA) offers an alternative practice that helps ensure findings are robust to a variety of model specifications. At its simplest level, BMA assigns priors to potential model specifications and then caluclates posterior distributions for the model itself, in addition to the coefficients within the specification. This is thus an extension of previous Bayesian theory that focuses solely on coefficient estimation.
Let us begin by considering a matrix \(X\) of all the \(n \times p\) potential independent variables to predict a response variable \(Y\). A standard linear analysis would assume \(Y = X \beta + \epsilon\), where \(\beta\) is a coefficient matrix and \(\epsilon\) ~ \(N(0, \sigma^2)\). There are \(q=2^p\) potential model specifications from the model space \(\{M_1, M_2, ...M_q\}\), and in many cases, there is ambiguity about which of these is best. BMA incorporates this uncertainty into the process rather than ignoring it and claiming that the final model is the only option. This leads to greater flexibility in the inferences of the end results [@Montgomery].
Each model \(M_k\) encompasses the likelihood function \(L(Y|\beta_k, M_k)\) of the observed data \(Y\) in terms of a model-specific coefficient matrix \(\beta_k\) with a prior \(\pi(\beta_k|M_k)\). Both the likelihood and priors are conditional on a particular model [@Fragoso]. The posterior distribution for the model parameters is then \[\pi(\beta_k|Y, M_k) = \frac{L(Y|\beta_k, M_k) \; \pi(\beta_k | M_k)}{\int L(Y|\beta_k, M_k) \; \pi(\beta_k | M_k) \; d\beta_k}\]
The above denominator represents the marginal distribution of the observations over all paramteter values specified in model \(M_k\). It is called the model’s marginal likelihood or model evidence and is denoted by \(\pi(Y|M_k)\). BMA now assumes a prior distribution over the model space describing the prior uncertainty over each model’s capability to accurately describe and/or predict the data. This is modeled as a probability density over the model space, with values \(\pi(M_k)\) for \(k = 1, 2, ... q\) [@Fragoso].
Then, the posterior probability of model specification \(M_k\) is \[\pi(M_k | Y) = \frac{L(M_k | Y) \; \pi(M_k)}{\sum_{k=0}^q \; \pi(Y|M_k)\; \pi(M_k)}\].
Traditional BMA is not without issues. For example, changing priors from one vague prior to another can significantly impact posterior model probablities. And calculating model likelihoods becomes extremely diffiuclt for anything but the simplest models.
As a result of these issues, a method called Pseudo-BMA was created. This has the same core idea as BMA, but instead of calculating model likelhoods, something called Leave-One-Out cross validation (LOO-CV) is used. *cite yaovehtari paper here
Similar to the information criterion, LOO-CV estimates how good a model is at predicting out of sample data. To do this, it fits the model on all but one data points, then checks how well it predicts the missing data point. The log of this probability is used This is done for every data point in the sample.
Put mathematically given S simulation draws , this is:
\[ \sum_{i=1}^nlog(p(y_i\vert y_1,\dots y_{i-1},y_{i+1},y_n)) \]
Where the term in the \(log\) is the probability of seeing \(y_i\) from a model trained on all the data excluding \(y_i\).
By adding a complicated bias term, this can be turned into an information criterion. Both AIC and WAIC will approach LOO-CV as sample size increases.
While exact LOO requires \(n\) iterations, one for each point in \(y\), Pseudo-BMA reduces computational complexity by taking samples from the posterior distribution. Using a technique called Pareto Smoothed Importance Sampling (PSIS), LOO-CV is estimated. This method also allows Pseudo-BMA to be used on models with parameters that have already been fitted.
Given a dataset \(y\), models \(M_k\), weights \(w_{i,k}^s\), and parameters\(\Theta\), PSIS-LOO is an efficent way to approximate the estimate of the expected log pointwise predictive density for model k, which is the measure of how good model k is at fitting out of sample data:
\[ \widehat{elpd}^k=\sum_{i=1}^n log(\hat{p} (y_i | y_1,\dots y_{i-1},y_{i+1},y_n, M_K)) \]
Where conditioning on \(M_k\) means using that model for parameter estimates.
This is similar to model probabilities using AIC, model probabilities are calculated by:
\[ w_k = \frac{exp(\widehat{elpd}^k)}{\sum_{k=1}^Kexp(\widehat{elpd}^k)} \]
In practice, one more step involving bootstrapping is used to deal with additional bias in the estimate. This final model weight \(w_k\) is the approximate probability of model k being to true model.
These model probabilities can be used in a variety of ways. For example, BMA allows for a direct combination of models to calculate combined estimations for parameters, which leads to lower risk predictions than a single model [@Fragoso]. In addition, BMA can be used in model selection by choosing the model with the highest posterior probability. We will focus on this latter application.
library(ggplot2)
library(dplyr)
library(rstan)
library(rstanarm)
library(bayesplot)
library(loo)
library(tidyr)
library(ggridges)
library(purrr)
library(gridExtra)
library(knitr)
To illustrate Pseudo-BMA in practice, we’ve chosen the famous candy dataset from the fivethirtyeight package:
candy <- fivethirtyeight::candy_rankings
names(candy)
## [1] "competitorname" "chocolate" "fruity" "caramel"
## [5] "peanutyalmondy" "nougat" "crispedricewafer" "hard"
## [9] "bar" "pluribus" "sugarpercent" "pricepercent"
## [13] "winpercent"
Plotting the top 10 candies by win percentage
candy %>%
head(20) %>%
ggplot() +
geom_col(aes(y = reorder(competitorname, winpercent), x = winpercent)) +
labs(x = "Percent of Head to Heads Won", y = NULL, title = "Top 20 Candies")
Plot how many candies are each type
candy %>%
summarise_if(is.logical, sum) %>%
pivot_longer(1:ncol(.), names_to = "type", values_to = "count") %>%
mutate(type_clean = c("Chocolate", "Fruity", "Caramel", "Peanuty or Almondy", "Nougat", "Crisped Rice Wafer", "Hard", "Bar", "Several Candies in One Bag")) %>%
ggplot(aes(y = reorder(type_clean, count), x = count)) +
geom_col() +
labs(y = NULL, x = NULL, title = "Number of Candies with each Attribute")
Plotting Price and Sugar Content vs Win Rate
winbycost <- candy %>%
ggplot(aes(x = pricepercent, y = winpercent)) +
geom_jitter() +
labs(x = "Relative Price", y = "Percent of Head to Heads Won", title = "Price versus Win Rate")
winbysugar <- candy %>%
ggplot(aes(x = sugarpercent, y = winpercent)) +
geom_jitter() +
labs(x = "Sugar Content (Percent)", y = "Percent of Head to Heads Won", title = "Sugar Content versus Win Rate") +
ylim(0, 100)
grid.arrange(winbycost, winbysugar, ncol = 2)
Win percent for each candy type:
candy %>%
select(winpercent, 2:10) %>%
pivot_longer(2:10, names_to = "type", values_to = "is_type") %>%
filter(is_type) %>%
select(-is_type) %>%
group_by(type) %>%
mutate(mean_win = mean(winpercent)) %>%
ungroup() %>%
ggplot(aes(x = winpercent, y = reorder(type, mean_win), height = stat(density))) +
geom_density_ridges(stat = "binline", bins = 20, scale = 0.95, draw_baseline = FALSE) +
labs(x = "Percent of Head to Heads Won", y = NULL, title = "Candy Type by Win Percent") +
scale_y_discrete(labels = rev(c("Crisped Rice Wafer", "Peanuty or Almondy", "Bar", "Chocolate", "Nougat", "Caramel", "Several Candies in One Bag", "Fruity", "Hard")))
We start by creating regular stan models. Each model differs by the number of variables considered, in this case. These posterior distributions are approximated via MCMC.
set.seed(454)
model_1 <- stan_glm(winpercent ~ sugarpercent,
data = candy,
family = gaussian,
chains = 4, iter = 2*5000, refresh = FALSE)
model_2 <- stan_glm(winpercent ~ sugarpercent + pricepercent,
data = candy,
family = gaussian,
chains = 4, iter = 2*5000, refresh = FALSE)
model_3 <- stan_glm(winpercent ~ chocolate + fruity + caramel + peanutyalmondy + nougat + crispedricewafer + hard + bar + pluribus,
data = candy,
family = gaussian,
chains = 4, iter = 2*5000, refresh = FALSE)
model_4 <- stan_glm(winpercent ~ chocolate + fruity + caramel + peanutyalmondy + nougat + crispedricewafer + hard + bar + pluribus + sugarpercent + pricepercent,
data = candy,
family = gaussian,
chains = 4, iter = 2*5000, refresh = FALSE)
model_list <- list(model_1 = model_1, model_2 = model_2, model_3 = model_3, model_4 = model_4)
Once we have the models, we can then use tools such as trace plots and density overlays to assess the stability of our models. The trace plot and overlay for model 1 is shown below:
mcmc_trace(model_1)
mcmc_dens_overlay(model_1)
While the trace plots show a fairly wide band, indicating some variation in the values used for each variable, no flatlining is present. We can also see that all models returned similar distributions for each of the parameters in model 1 (Intercept, sugarpercent, sigma). Similar behavior is seen in all the trace plots and density overlays for each model (see Appendix), thus the models appear usable.
To assess if the structure of our models is reasonable, we can use pp_check() to compare simulated samples to the real values.
grid.arrange(pp_check(model_1) + labs(title = 'Model 1'), pp_check(model_2) + labs(title = 'Model 2'),
pp_check(model_3) + labs(title = 'Model 3'), pp_check(model_4) + labs(title = 'Model 4'))
All models seem to roughly fit the distribution of the data. There are a few noticeable outliers in every model, and each model appears to have a higher peak than the distribution of the actual data, but these are not considered wrong models.
As we are satsified, we can proceed with these models. Here are the coefficients:
model_1$coefficients
## (Intercept) sugarpercent
## 44.67007 11.81196
model_2$coefficients
## (Intercept) sugarpercent pricepercent
## 39.802262 6.722496 15.580509
model_3$coefficients
## (Intercept) chocolateTRUE fruityTRUE
## 35.2393393 19.6522822 10.0194746
## caramelTRUE peanutyalmondyTRUE nougatTRUE
## 3.3625741 9.9863058 2.3087578
## crispedricewaferTRUE hardTRUE barTRUE
## 8.8676580 -4.8126445 -0.5306398
## pluribusTRUE
## -0.1460981
model_4$coefficients
## (Intercept) chocolateTRUE fruityTRUE
## 34.7283883 19.5438496 9.1570800
## caramelTRUE peanutyalmondyTRUE nougatTRUE
## 2.2049288 9.9400899 0.7589168
## crispedricewaferTRUE hardTRUE barTRUE
## 8.7610339 -6.0805383 0.5256453
## pluribusTRUE sugarpercent pricepercent
## -0.8447881 9.1226100 -5.7884718
Once the models are calculated, we calcuate Leave One Out expected log point predictive densities (ELPD LOO) for each model.
Calculate
"Model 1 Loo"
## [1] "Model 1 Loo"
(loo_1 <- loo(model_1))$estimates
## Estimate SE
## elpd_loo -349.299296 5.8136277
## p_loo 2.708551 0.5166705
## looic 698.598592 11.6272553
"Model 2 Loo"
## [1] "Model 2 Loo"
(loo_2 <- loo(model_2))$estimates
## Estimate SE
## elpd_loo -346.734615 6.6778576
## p_loo 4.051253 0.9045871
## looic 693.469229 13.3557153
"Model 3 Loo"
## [1] "Model 3 Loo"
(loo_3 <- loo(model_3))$estimates
## Estimate SE
## elpd_loo -329.91156 5.818139
## p_loo 10.57069 1.541516
## looic 659.82312 11.636278
"Model 4 Loo"
## [1] "Model 4 Loo"
(loo_4 <- loo(model_4))$estimates
## Estimate SE
## elpd_loo -330.25141 5.930144
## p_loo 12.93009 1.862167
## looic 660.50283 11.860288
lpd_point <- cbind(
loo_1$pointwise[,"elpd_loo"],
loo_2$pointwise[,"elpd_loo"],
loo_3$pointwise[,"elpd_loo"],
loo_4$pointwise[,"elpd_loo"]
)
With the ELPD calculations, the models are able to be ranked in order of their contribution, as a percentage. As shown, model3 has the highest weight with 0.534, followed by model4. By this metric, model3 appears to be the best model from the set of 4 we created above.
(weights <- pseudobma_weights(lpd_point))
## Method: pseudo-BMA+ with Bayesian bootstrap
## ------
## weight
## model1 0.001
## model2 0.006
## model3 0.552
## model4 0.441
A new type of candy is created (new_candy), which is turned into a dataframe for prediction. A table of predictions for all models is then created, titled predictions.
new_candy <- data.frame(chocolate = TRUE, fruity = TRUE, caramel = TRUE, peanutyalmondy = TRUE, nougat = TRUE, crispedricewafer = TRUE, hard = FALSE, bar = FALSE,
pluribus = FALSE, sugarpercent = 0.20, pricepercent = 0.9)
my_predict <- function(model) {
posterior_predict(model, newdata = new_candy)
}
make_df <- function(predictions, index) {
data.frame(winpercent_new = predictions[,1], model = index)
}
predictions <- map(model_list, my_predict) %>%
map2(1:4, make_df)
Using the weights we calculated from using the pseudobma_weights() function to obtain ELPD scores, we can then generate the predictive distributions for the win percentage of our made up candy, defined above.
sampled_pred <- predictions %>% #calculating sample predictions for each model
map2(weights, sample_frac) %>%
bind_rows() %>%
mutate(model = as.character(model))
sampled_pred %>%
ggplot(aes(x = winpercent_new, fill = model, color = model)) +
geom_density_ridges(alpha = 0.2, aes(y = model)) + labs(title = 'Posterior Predictive Distributions for our Generated Candy')
## Picking joint bandwidth of 3.32
Unsurprisingly, the different models provided slightly different predictions. The \(MAP\) ‘winpercent’ for model, model2 appear to be closer to 45-50%, while the model3, model4 show a ‘winpercent’ of closer to 80%.
sampled_pred %>%
ggplot(aes(x = winpercent_new, fill = model, color = model)) +
geom_histogram(alpha = 0.9, position = "stack")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As a more visual demonstration of the weights, the contribution of each model is evident in the predictive distribution, with model 3 contributing the largest, followed by model 4. Thus, it is not surprising that the pseudo-BMA model also has an \(MAP\) of around 80%, given the dominance of model3 in the weighted averaging.
As shown above, it is evident that the pseudo-BMA model’s prediction is heavily reliant on model 3 & 4, with similar \(MAP\) values of around 80% for the win percentage for the made up candy. To better understand the advantage of Pseudo-BMA, we can compare its predictions to that of model 3, the ‘best’ standalone model, using the ppc_intervals function.
A sample was generated for this test.
set.seed(454)
sampled_candy <- sample_frac(candy, size = 0.2, replace = FALSE)
The mean predictions for the Pseudo-BMA model were obtained by a weighted average of all model predictions. Then Model 3’s predictions could be compared visually to the Pseudo-BMA model:
candy_predictions3 <- posterior_linpred(
model_3,
newdata = sampled_candy, transform = TRUE)
pred3 <- colMeans(candy_predictions3) #needed to take the mean for plotting, below
pseudo_predictions <- candy_predictions1*weights[1] + candy_predictions2*weights[2] + candy_predictions3*weights[3] + candy_predictions4*weights[4]
ppc_intervals(sampled_candy$winpercent,
yrep = pseudo_predictions, prob_outer = 0.95) +
ggplot2::scale_x_continuous(
labels = sampled_candy$competitorname,
breaks = 1:nrow(sampled_candy)
) +
xaxis_text(angle = 90, vjust = 1, hjust = 0) +
lims(y = c(5, 90)) + geom_point(aes(x = c(1:17), y = pred3), color = 'red') + labs(title ="Pseudo-BMA model vs Model 3 On 17 Randomly Sampled Candies",
subtitle = 'Red Dots are Model 3 average predictions')
Once again, unsurprisingly, the mean predictions for all 17 candies for model3 and the pseudo-BMA weighted average are fairly close. However, evident for Baby Ruth, Nestle Butterfinger, and Reese’s Peanut Butter Cup the pseudo-BMA weighted average model is slightly closer to the actual win percentage. In other cases, like Sour Patch Kids, Reese’s Miniatures, and Air Heads, model 3 appears to have better predictions.
# rbind(prediction_summary(y = sampled_candy$winpercent,
#yrep = colMeans(pseudo_predictions)),
# prediction_summary(y = sampled_candy$winpercent,
#yrep = candy_predictions_3))
comparison_results <- tibble(
model = c('Pseudo-BMA', 'Model 3'),
mae = c(22.25264, 24.55274),
mae_scaled = c(0.7003848,5.5119684),
within_50 = c(0.17647059, 0.05882353),
within_95 = c(0.2352941 , 0.1764706 )
)
kable(comparison_results)
| model | mae | mae_scaled | within_50 | within_95 |
|---|---|---|---|---|
| Pseudo-BMA | 22.25264 | 0.7003848 | 0.1764706 | 0.2352941 |
| Model 3 | 24.55274 | 5.5119684 | 0.0588235 | 0.1764706 |
While hard to tell which is better visually, prediction_summary() plainly indicates the better model. Due to the use of a weighted average, Pseudo BMA is evidently more accurate than model 3. On average, pseudo-BMA was 2% closer to the actual win percentage of a candy than model 3, (MAE), and 23.5% of the candy’s win percentage is within the middle 95% of pseudo-BMA’s predictive distribution.
The concept of taking a weighted average of the predictions from multiple models is not a new technique, given the obvious benefits of increased flexibility and robustness to predictions. However, weighted averages are predominantly seen in Frequentist approaches to modelling, and less so in the Bayesian approach. Bayesian Model Averaging, and more specifically Pseudo-BMA, is a more computationally efficient method to evaluate each model’s predictive distribution, assigning ELPD scores for each model. ELPD, as fractional indication of model weight, allows for predictions from each model to be combined in a manner that provides the optimal predictive distribution, a Pseudo-BMA model. As shown by the candy dataset, Pseudo-BMA naturally outperforms the best single-model calculated, and could only improve with even more models tested.
mcmc_trace(model_2)
mcmc_dens_overlay(model_2)
mcmc_trace(model_3)
mcmc_dens_overlay(model_3)
mcmc_trace(model_4)
mcmc_dens_overlay(model_4)
Models graphed in that order:
candy_plotter(model_1)
candy_plotter(model_2)
candy_plotter(model_4)